Loans_full_schema dataset :
| emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | ... | sub_grade | issue_month | loan_status | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | global config engineer | 3.0 | NJ | MORTGAGE | 90000.0 | Verified | 18.01 | NaN | NaN | NaN | ... | C3 | Mar-18 | Current | whole | Cash | 27015.86 | 1999.33 | 984.14 | 1015.19 | 0.0 |
| 1 | warehouse office clerk | 10.0 | HI | RENT | 40000.0 | Not Verified | 5.04 | NaN | NaN | NaN | ... | C1 | Feb-18 | Current | whole | Cash | 4651.37 | 499.12 | 348.63 | 150.49 | 0.0 |
| 2 | assembly | 3.0 | WI | RENT | 40000.0 | Source Verified | 21.15 | NaN | NaN | NaN | ... | D1 | Feb-18 | Current | fractional | Cash | 1824.63 | 281.80 | 175.37 | 106.43 | 0.0 |
| 3 | customer service | 1.0 | PA | RENT | 30000.0 | Not Verified | 10.16 | NaN | NaN | NaN | ... | A3 | Jan-18 | Current | whole | Cash | 18853.26 | 3312.89 | 2746.74 | 566.15 | 0.0 |
| 4 | security supervisor | 10.0 | CA | RENT | 35000.0 | Verified | 57.96 | 57000.0 | Verified | 37.66 | ... | C3 | Mar-18 | Current | whole | Cash | 21430.15 | 2324.65 | 1569.85 | 754.80 | 0.0 |
5 rows × 55 columns
Description :
| emp_length | annual_income | debt_to_income | annual_income_joint | debt_to_income_joint | delinq_2y | months_since_last_delinq | earliest_credit_line | inquiries_last_12m | total_credit_lines | ... | public_record_bankrupt | loan_amount | term | interest_rate | installment | balance | paid_total | paid_principal | paid_interest | paid_late_fees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9183.000000 | 1.000000e+04 | 9976.000000 | 1.495000e+03 | 1495.000000 | 10000.00000 | 4342.000000 | 10000.00000 | 10000.00000 | 10000.000000 | ... | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 5.930306 | 7.922215e+04 | 19.308192 | 1.279146e+05 | 19.979304 | 0.21600 | 36.760709 | 2001.29000 | 1.95820 | 22.679600 | ... | 0.123800 | 16361.922500 | 43.272000 | 12.427524 | 476.205323 | 14458.916610 | 2494.234773 | 1894.448466 | 599.666781 | 0.119516 |
| std | 3.703734 | 6.473429e+04 | 15.004851 | 7.016838e+04 | 8.054781 | 0.68366 | 21.634939 | 7.79551 | 2.38013 | 11.885439 | ... | 0.337172 | 10301.956759 | 11.029877 | 5.001105 | 294.851627 | 9964.561865 | 3958.230365 | 3884.407175 | 517.328062 | 1.813468 |
| min | 0.000000 | 0.000000e+00 | 0.000000 | 1.920000e+04 | 0.320000 | 0.00000 | 1.000000 | 1963.00000 | 0.00000 | 2.000000 | ... | 0.000000 | 1000.000000 | 36.000000 | 5.310000 | 30.750000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2.000000 | 4.500000e+04 | 11.057500 | 8.683350e+04 | 14.160000 | 0.00000 | 19.000000 | 1997.00000 | 0.00000 | 14.000000 | ... | 0.000000 | 8000.000000 | 36.000000 | 9.430000 | 256.040000 | 6679.065000 | 928.700000 | 587.100000 | 221.757500 | 0.000000 |
| 50% | 6.000000 | 6.500000e+04 | 17.570000 | 1.130000e+05 | 19.720000 | 0.00000 | 34.000000 | 2003.00000 | 1.00000 | 21.000000 | ... | 0.000000 | 14500.000000 | 36.000000 | 11.980000 | 398.420000 | 12379.495000 | 1563.300000 | 984.990000 | 446.140000 | 0.000000 |
| 75% | 10.000000 | 9.500000e+04 | 25.002500 | 1.515455e+05 | 25.500000 | 0.00000 | 53.000000 | 2006.00000 | 3.00000 | 29.000000 | ... | 0.000000 | 24000.000000 | 60.000000 | 15.050000 | 644.690000 | 20690.182500 | 2616.005000 | 1694.555000 | 825.420000 | 0.000000 |
| max | 10.000000 | 2.300000e+06 | 469.090000 | 1.100000e+06 | 39.980000 | 13.00000 | 118.000000 | 2015.00000 | 29.00000 | 87.000000 | ... | 3.000000 | 40000.000000 | 60.000000 | 30.940000 | 1566.590000 | 40000.000000 | 41630.443680 | 40000.000000 | 4216.440000 | 52.980000 |
8 rows × 42 columns
Anomaly Description:
The Dataset has records with NULL values as an anomaly. These records are filled with '0' for quantitative data.
For qualitative data, NULL values can be filled with meaningfull information.
Before replacing NULLs, the count of rows is 10000
After replacing NULLs, the count of rows is 10000
Dataset after NULLs replacement:
| emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | ... | sub_grade | issue_month | loan_status | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9995 | owner | 10.0 | TX | RENT | 108000.0 | Source Verified | 22.28 | 0.0 | NA | 0.00 | ... | A4 | Jan-18 | Current | whole | Cash | 21586.34 | 2969.80 | 2413.66 | 556.14 | 0.0 |
| 9996 | director | 8.0 | PA | MORTGAGE | 121000.0 | Verified | 32.38 | 0.0 | NA | 0.00 | ... | D3 | Feb-18 | Current | whole | Cash | 9147.44 | 1456.31 | 852.56 | 603.75 | 0.0 |
| 9997 | toolmaker | 10.0 | CT | MORTGAGE | 67000.0 | Verified | 45.26 | 107000.0 | Source Verified | 29.57 | ... | E2 | Feb-18 | Current | fractional | Cash | 27617.65 | 4620.80 | 2382.35 | 2238.45 | 0.0 |
| 9998 | manager | 1.0 | WI | MORTGAGE | 80000.0 | Source Verified | 11.99 | 0.0 | NA | 0.00 | ... | A1 | Feb-18 | Current | whole | Cash | 21518.12 | 2873.31 | 2481.88 | 391.43 | 0.0 |
| 9999 | operations analyst | 3.0 | CT | RENT | 66000.0 | Not Verified | 20.82 | 0.0 | NA | 0.00 | ... | B4 | Feb-18 | Current | whole | Cash | 11574.83 | 1658.56 | 1225.17 | 433.39 | 0.0 |
5 rows × 55 columns
The depicted heatmap represents the correlation between each feature. In this, we can view that there are close relationships between the following:
debt_to_income and annual_income_joint
open_credit_lines and num_satisfactory_accounts
current_accounts_delinq and num_accounts_30d_past_due
This chart illustrates loan taken state wise in the USA. Hovering over the chart will show the count of loans in that particular state.
Status of the Loan is of 6 types. In the below pie chart, we can view each combination of types and their split percentage. As a Lending Club, we might be more interested in noting the defaulters percentage. This can be done by just selecting the two 'Late' categories from legends.
The population of loan requestors are classified into 7 grades (A to G). Wherein, Grade-G is the highest risk category who will have to pay more interest and Grade-A is the lowest risk category. Amongst these 7 categories, there are 5 sub-grading categories. The depicted stack will represent each grade with it's respective sub-grades splitted based on the available dataset.
Dataset is split into No-Risk, Low, Medium and High risk categories based on the column "num_historical_failed_to_pay". The population of loan requestors are being classified as 'Low' risk when the number of times they haved failed to re-pay the loan is less in the past.
In the portrayed chart, we can view that interest rate is strictly within a lower range for No-Risk population. Whereas, the rate of interest varies more for the other three Risk categories.
Response variable to be measured is interest rate (Continuos variable). We can use a Linear Regression model to measure a continuos variable.
In our dataset, there are a few categorical variables to be dealt with when building a Linear Regression model. We can use two methods to cleanse our dataset,
(1) Factorizing the categorical varibale
(2) One-hot encoding method
The accuracy of the model increases to 99.97% from 74.19% on using On-hot encoding method.
Model Score when using factorize method is 74.1952665370125 %
Model Score when using One-hot encoding method is 99.9739423304772 %
Result Visualisation:
The predictions of the test and train dataset allign to form a linear line to show the goodness of the model.
As One-hot encoding is a bteer approach. We continue to use the same train and test dataset for our second model. The predictions of the test and train dataset allign to form a Linear line to show the goodness of the model.
RandomForest model score on using One-hot encoding method is 99.99284801321079 %
The results of the RandomForest model is as below:
Given more time, I would have been able to analyse the data more deeper, and build better models and experiment different approaches. Also, I can improve the co-efficients and used boosting methods. In-depth analysis of the data would help me provide more informative visualizations.